Please let me know if you find mistakes (in particular, some characters might have disappeared during Markdown formatting).

1.1 Regular Expressions (aka regexp/regex)

Write regular expressions for (at least some of) the following. The valid/invalid examples are list of literal strings you might paste directly into your Java or Python (or similar) program for testing your regular expressions. Note that the concrete syntax of regular expressions will vary from tool to tool – see more about this below.

For each of them, you may want to:
- Draw the regex as a diagram or state machine
- Visualize it with an online tool such as https://regexper.com/
- Debug it with an online tool such as https://regex101.com/ (some of these may be buggy)
- Try searching and matching with various tools or programming languages

The language of Norwegian sheep. Sheep can say bæ, bæææ, bæbæbææææ, and so on, with at least one 'æ' following a 'b'. Examples:

Valid strings: "bæ", "bæbæ", "bææææbææææbæbæ", "bæbæbæbæbæ"

Invalid strings: "", "b", "bø", "ææææ", "æ", "bbæææ", "quack"
Traditional C comments. A C comment starts with /* and ends with */ and cannot be nested. Java/C++-style //-comments are not supported. Examples:

Valid strings: "/* foo */", "/** */", "/*/* /**/"

Invalid strings: "/*", "// foo", "/* /* */ */", "/*/"

To do this correctly, you may need to use negative lookahead. E.g., x(?!y) means "x not followed by y".
String literals, according to the following rules (C-like): they begin and end with a double quote ("), and any double or single quote must be escaped by a preceding backslash (\). The backslash itself must also be escaped. For example, "foo", "fo\"o" and "fo\\" and "\\\"" are legal strings, and "ba"", "b\jar" are illegal.

Valid strings: "\"foo\"", "\"fo\\\"o\"", "\"\"" (i.e., "foo", "fo\"o", "" without extra quotes)

Invalid strings: "\"", "\"f\"o\"", "\"\\\\\\\"" (i.e., ", "f"o", "\\\" without extra quotes)
Add the following to the string literal definition: a string may contain \xNN or \uNNNN, where N is any hexadecimal digit (this would be used to insert arbitrary characters in hexadecimal ASCII or Unicode notation).
Integer constants: an optional + or -, followed by decimal digits.

Valid strings: "1234", "+1", "-0", "0"

Invalid strings: "+-0", "+", "a", "", "0.0"
Numeric constants in scientific notation, consisting of:
- an optional sign (+ or -)
- zero or more decimal digits
- a decimal point (.)
- zero or more decimal digits
- an optional exponent part, consisting of
  - an upper or lowercase e
  - an optional sign (+ or -)
  - one or more decimal digits
Additional rules: the numbers should contain at least one digit either before or after the decimal point; and at least one of the decimal point or the exponent must be present.

Valid strings: "-2.0", "2e5", "314e-2", ".0", "1."

Invalid strings: ".", "1", "1e2e4", "1e2.0", "1.+10e2"
(Hard!) Internet names and IP addresses. Find the exact syntax rules online or make something up; in general, a host or domain name has one or more labels separated by dots (.), possibly ending with a dot. An IPv4 address in dotted-decimal notation contains four numbers from 0-255, separated by dots (.), and an IPv6 address contains:
- Eight groups of four hexadecimal digits, separated by colons (:)
- Leading zeros may be omitted from groups
- Multiple groups of zeros may be replaced by a double colon (::) – this is only legal once in an address
(Very hard!) Email addresses, as defined by RFC-822 (old) or RFC-5322 (recent). Then search for a solution online and find out why it may not be a particularly good idea.
(Very hard!) URIs, as defined by RFC-3986. Could, for instance, be used together with searching to automatically make links of URIs in a document. (Again, you should probably consider whether this would be a sensible use of regexes.)

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. -- Jamie Zawinski

Extra Regular Expression Shorthands

The following shorthands are supported by many, but not all tools, and may be useful:

\s or [:space:] any space (incl tabs and newlines)
\S any non-space
\w any word-character (alphanumeric plus underscore)
\W any non-word-character
\d or [:digit:] a digit
\D a non-digit
\b word boundary

The [:foo:] style is typically used by standard Unix tools, and only works inside character classes; e.g., [[:space:]]. Perl, Java, Python and many other languages support the \x convention.

1.2 Regular Expressions and Tools

Try your regular expressions from 1.1 on different tools (see below for examples).
Check that they actually work and match what you think they did.
Do you have to modify your expressions to make them fit the syntax of the tool?

Matching vs Searching

There's often a distinction between matching and searching. Matching a regular expression means checking that a string conforms to the grammar specified by the regular expression. Typically the match starts at the beginning of the string; the string may or may not be allowed to have trailing characters that are not matched. Searching will try to identify one (or more) substrings that match a regular expressions.

If you want to be sure your regular expression is matched against the entire string, insert a ^ (matches beginning of string) and $ (end of string) at the beginning/end of the regular expression. Whether 'beginning' or 'end' refers to a line or the entire string depends on your tool (e.g., grep will match lines; String.matches() will match the entire string by default). For example:

Starts with a comment: ^//

An empty string/line: ^$

A string/line of spaces: ^ *$

Regular Expressions in Java

You can match regular expressions in Java using the String.matches() method:

#!java
assertTrue("foo".matches("fo+"));

In more complicated cases, you'll want to use the Pattern class, which lets you precompile a regular expression (makes it faster!) and also lets you do more advanced things.

Java supports a wide range of regular expression operators.

Regular Expressions in Python

In Python, you'll find tools for regular expressions in the re package:

#!python
import re
re.match("fo+", "f")

re.match() gives you a Match object on success, which you can use to find out more about the match (start / end of matching region, information about subgroups (things that are in parentheses in the expression) and so on). If there is no match, you get None.

For example, in Python 2.x, you might test a whole bunch of strings this way:

#!python
for s in ["a", "foo", "bææ"]:
     print s, re.match("fo+", s) != None

Python also has a rich regular expression language, similar to that found in Perl.

Regular Expressions in Unix Tools

You can use grep(1) to find matches in files:

#!bash
grep regexp file1...

Typically you'll have to enclose the regular expression in single quotes, or the shell will expand it and cause confusion.

This will find all (or at least most – it will be confused by comments and line breaks) productions in a Rascal grammar:

#!bash
grep '^[ \t]*syntax[ \t][ \t]*[A-Z][a-zA-Z0-9_]*' MyLang.rsc

Similarly, sed(1) will allow you to do simple text manipulations:

#!bash
sed -e 's/FOO/BAR/g' < input > output

Replaces all instances of FOO by BAR in input and writes the result to output.

Both sed and grep support a limited subset of regular expressions (e.g., they don't support +). The variant egrep(1) is better, but still somewhat different from what you find in Python, Java, Perl, etc.

1.3 Context-Free Grammars

Consider this very simple grammar (in Rascal syntax notation):

start syntax Program = Expr;

syntax Expr
    = ID                  // variables
    | NUM                 // integers
    | Expr "+" Expr       // addition
    | Expr "*" Expr       // multiplication
    | ID "(" Expr ")"     // function call
    | "(" Expr ")"        // parentheses
    ;

The start symbol is Program, and we define a single non-terminal Expr for expressions, with multiple alternatives (separated by |). In addition to the literal operator symbols, we have two terminals ID and NUM. Their definitions are:

// identifiers
//    y !<< x means 'x' must not be preceded by  'y'
//    x !>> y means 'x' must not by followed by 'y'
// so, this means that an identifier is a sequence of one
// or more letters or underscores, with no additional
// letters or underscores before or after
lexical ID = [a-zA-Z_] !<< [a-zA-Z_]+ !>> [a-zA-Z_];

// numbers
lexical NUM = [0-9] !<< [0-9]+ !>> [0-9];

The interesting bits are [a-zA-Z_]+ and [0-9]+. The stuff before !<< and after !>> is just special Rascal notation for saying that there shouldn't be any leftover letters/digits before/after the identifier or number.

Write a few syntactically valid expressions. Make sure you use all the "features" of the language, including nested expressions.
Draw parse trees for your expressions.
Draw parse trees for the following expressions. Do any of the expressions have multiple parse trees? If so, draw all the trees.

    a + b * 2

    f(g(x))

    a + b + c

    (a + b) * 2

    (f((2)))

1.4 Parsing with Rascal

Do this if you have extra time, want to get a head start or don't have anything better to do in your weekend.

Rascal documentation is available from the Rascal Tutor, in particular the Language and Libraries Reference. There is also a cookbook of common solutions.

Follow the Rascal installation instructions. (Rascal should work also with the new Eclipse Neon. Make sure you have a full Java 8 JDK installed, and in use by Eclipse.)
Open the Rascal perspective, with the New Perspective button in the upper right corner.
Create a new Rascal project (File → New → Rascal Project).

Make a Rascal file (File → New → Rascal Module). Name it "Simple", and paste the following code into it (and save it):

module Simple

start syntax Program = Expr;

syntax Expr
    = ID                  // variables
    | NUM                 // integers
    | Expr "+" Expr       // addition
    | Expr "*" Expr       // multiplication
    | ID "(" Expr ")"     // function call
    | "(" Expr ")"        // parentheses
    ;

// identifiers
//    y !<< x means 'x' must not be preceded by  'y'
//    x !>> y means 'x' must not by followed by 'y'
// so, this means that an identifier is a sequence of one
// or more letters or underscores, with no additional
// letters or underscores before or after
lexical ID = [a-zA-Z_] !<< [a-zA-Z_]+ !>> [a-zA-Z_];

// numbers
lexical NUM = [0-9] !<< [0-9]+ !>> [0-9];

Go to the Rascal console, and type import Simple; (if this fails, the console might not be associated with the right project (check for "[project: your-project-name..." on the console tab); if so, close the console and reopen while the project is selected). With the Simple grammar loaded in the console, try parsing all the example expressions you created. Things in the concrete object syntax (the grammar you've supplied, as opposed to the grammar for Rascal itself) are written in back-quotes (a + b). You can tell Rascal which non-terminal to expect by prefixing the concrete syntax fragment with the name of the non-terminal in parentheses:
```
rascal>(Program)`x`
Program: (Program) `x`
rascal>(Expr)`x`
Expr: (Expr) `x`
rascal>(ID)`x`
ID: (ID) `x`
```
Try running unparse(...) on the result (you will need to import ParseTree; first). You should get the input string back.

You can also make a simple visualisation of the parse tree using Rascal's visualisation library. Try this recipe:

import ParseTree;
import vis::Figure;
import vis::ParseTree;
import vis::Render;
render(visParsetree(parse(#Expr, "1+2*3")));

Rascal has build in support for regular expression matching. Regular expressions are written between slashes, and you can match against a string using the := operator. For example, /bæ(æ|-æ)*/ := s.

Rascal Install Tips

Rascal is very picky about having access to a proper JDK (with a compiler) and not just a JRE (the runtime). For this to work, Eclipse must be started with the java command belonging to a JDK and not a JRE.

On Linux and Mac, this seems to work magically as long as you have a JDK installed
On Windows, Eclipse seems to prefer using an installed JRE if possible. To force it to use the JDK, you must
- Uninstall the JRE (using "Add or Remove Programs")
- Add the "bin" folder of your JDK to your PATH environment variable:
  - Locate the JDK (typically, it's in C:\Program Files\Java\jdk1.8.0_...\bin) and copy the location
  - Edit your environment variables, by pressing the start button, searching for "environment" and selecting "Edit the system environment variables"
  - Update your PATH variable -- it's a semicolon-separated list of folders. Paste the bin folder into the path, making sure it's connected to the other folders in the path with a semicolon.
- Start Eclipse again

If Rascal can't find a JDK, it will silently fail. You can see this if you follow the instructions and try to start a console; the console will say (at the top) "<Closed>".

Wiki

inf225 / 2017 / ex / Exercise 1